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O Abstract 

Data points are placed in bins when a histogram is created, but 

' ^h'' there is always a decision to be made about the number or width of 

C3 , the bins. This decision is often made arbitrarily or subjectively, but it 

C$ ' need not be. A jackknife or leave-one-out cross-validation likelihood is 

C3 ■ defined and employed as a scalar objective function for optimization 

of the locations and widths of the bins. The objective is justified as 
Y^ \ being related to the histogram's usefulness for predicting future data. 

• ^h ■ The method works for data or histograms of any dimensionality. 

,^h 1 Introduction 



There are many situations in experimental science in which one is presented 

with a collection of discrete measurements x^ and one must bin those points 

into a set of finite-sized bins i, with centers X» and full-widths A», to create 
OO ■ 

a histogram of numbers of points iVj, or the equivalent when the points have 

r^l ! non- uniform weights Wj. The problem of binning comes up, for example, 

when one needs to plot a data histogram, when one needs to perform least- 
square fitting of a probability distribution function, and when one wants to 
compute entropies or other measurements on the inferred data probability 
distribution function. 

The choice of bin centers and widths often seems arbitrary. However, 
there is a non-arbitrary choice, derived below, which emerges when the his- 
togram is thought of as an estimate of the probability distribution function 
of whatever process generated the data. If the binning is too coarse, the his- 
togram does not give much information about the shape of the probability 
distribution function. If the binning is too fine, bins become empty and the 
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histogram becomes noisy, so it in some sense "overfits" the data. The best 
binning lies in between these extremes and can be found simply and quickly 
by a "jackknife" or cross-validation method, that is, by excluding data sub- 
samples and using the non-excluded data to predict the excluded data. This 
is not the only data-based binning-choice approach_|, but it is simple and 
sensible. 

In what follows, we are going to consider a data histogram, which we 
imagine as a set of bins i, with centers Xj and widths (or multi-dimensional 
volumes) Aj. Equivalently (and perhaps more usefully), the parameteri- 
zation of the bins can be described by a set of edges X(j_i/ 2 ) so the cen- 
ters become Xj = fX(j_x/ 2 ) + X(j + i/ 2 ) J /2 and the widths become A« = 

X(j + i/2) — X(j_!/ 2 ) . These bins will get filled by a set of (possibly multi- 
dimensional) data points x,-, leading to each bin i containing a number of 
data points iVj. We will also make reference to the binning function z(x) 
which, for a given data value x, returns the bin i. 

2 Model probability distribution function 

Our best binning is based on the idea that the histogram is a sampling of a 
probability distribution function and can therefore be thought of as providing 
an estimate or model of that probability distribution function. 

One possible (approximate) probabilistic model for the data is that they 
are drawn from a probability distribution function such that, in each bin of 
the histogram we are making, the probability is constant and proportional to 
the number of actual data points that landed (by chance) in that bin. This 
model has the limitation that bins that happen (by chance) to be empty 
will be assigned zero probability; when a new datum happens to arrive (by 
chance) inside one of those previously empty bins, it will be assigned a van- 
ishing likelihood and render the probabilistic model false at (arbitrarily) high 
confidence. 

A more well-behaved (approximate) probabilistic model is that the prob- 



2 see, for example, Knuth, K. H., "Optimal data-based binning for histograms," 



arXiv:physics/0605197, and references cited therein. 



ability p(i) that a data point land in bin % is 

C\ N i+ a m 

E [^* + «] 

where a is a dimensionless "smoothing" constant of order unity (to be set 
later). Here, so long as there are a finite number of bins, the probability 
is non-zero in every bin. The associated (approximate, model) probability 
distribution function is 

/>) = « , (2) 

A i(x ) 

where i(x) is the function that returns the bin i for any value x. Note that 
the function /(x) is normalized by construction; 

/(x)dx=l . (3) 

In general, the data points will not all be treated equally, but in fact each 
data point Xj will come with a weight Wj, and each bin % will contain total 
weight W{. The only change this makes is in the inferred probability p(i), 

which becomes 

f\ Wi + a 

P' = v^ 777} , 1 , (4) 

k 
where now the smoothing constant a will be of order the mean weight Wj. 



3 Jackknife likelihood 

Imagine now that a new datum is recorded and happens to fall in bin i. 
The (logarithmic) likelihood of this new datum (according to the approx- 
imate model) is just ln/(x). If the binning is extremely fine (Aj small), 
then most bins will be empty and assigned identical probabilities. If the 
binning is extremely coarse (Aj large), then although most bins will have 
high probabilities, they will not have large values of /(x) because they will 
have large widths. In either case, the predictive power of the model prob- 
ability distribution function /(x) is low. For most well-behaved continuous 
(true) probability distribution functions, there is a finite binning at which 
the likelihoods of new data are maximized. 



With a finite data set, a "jackknife" or leave-one-out cross-validation 
likelihood L can be defined to be the total weighted (logarithmic) likelihood 
of each data point Xj as computed from the model probability distribution 
function /(x) computed from all the data points other than point j. 
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where, again, i(x.j) is the function that returns the bin % containing the data 
point Xj. 

In the simple case of no weighting (or, equivalently, Wi = 1 for all i), this 
jackknife likelihood can be written as 
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L = ^iV J ln 
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where the sum over data points has been converted into a sum over bins, 
because the latter is generally far faster. 

The "best" binning parameters Xj, Aj and a are those that maximize 
the jackknife likelihood L. This defines a non-arbitrary choice of binning. 
The choice is also motivated; it is the choice that best predicts future data, 
under the assumption that the existing data are representative. 



4 Examples 

As a simple test, consider equal- width (all Aj equal) binnings of a set of 
(one-dimensional) measurements Xj (galaxy colors in this case), in a fixed 
range x min < x < x max . In this simple situation, the binning only has two 
parameters: the number iV of bins (which, given the color range, fixes the 
bin positions Xj and common widths Aj = A) and the smoothing a. The 
binning function i(x) is then simply 
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where 

a -^max ■'mill / Q \ 

A = N • (8) 

Figure [1] shows the results of a grid search in this parameter space for the 
optimum binning for the 0A [g — r] colors of a large number of galaxies, and 
the same for a smaller subsample. 

There is nothing special (except simplicity) about the one-dimensional 
case. Figure [2] shows the results of a grid search for the optimal two- 
dimensional equal-width binning for two quantities (the 0A [g — r] colors and 
Sersic indices n of the same set of galaxies). In the two-dimensional case, the 
optimal binning is coarser (because the space is "bigger"). 

5 Discussion 

I have shown that when a histogram of data needs to be made, there is a 
non-arbitrary choice of binning. Some qualitative observations follow. 

• The optimal bin widths get smaller as the number of data points goes 
up or as the features in the (true) probability distribution function get 
narrower. 

• The results are more sensitive to the smoothing parameter a when the 
number of empty or near-empty bins becomes significant. 

• The jackknife likelihood makes discontinuous jumps as the bin edges 
cross individual data points. For this reason, the likelihood does not 
have well-defined derivatives. Some care must be taken that the op- 
timization technique does not depend on having a differentiable likeli- 
hood function. 

• There is nothing special about one-dimensional or two-dimensional dis- 
tribution functions; this is easily generalized to n-dimensional distribu- 
tions. However, it takes a lot of data points to measure a distribution 
function in n dimensions when n is large; I understand that the required 
number of data points scales worse than e n [need ref]. 

• There is nothing special about equal-width binning; I simply chose this 
to make the optimization problem easily tractable and the results easily 
presentable. 



• This method makes no reference to the errors or uncertainties on the 
measurements Xj. Effectively, I have assumed that the errors are small 
relative to any real features in the probability distribution function. In 
practice, it is rarely useful to have more than a few bins per the width 
of your error distribution, if all the points have similar uncertainties. 

• There is often an additional choice about what minimum and maximum 
data values to allow for histogramming. This choice also ought to be 
made in a non-arbitrary fashion if there are data points that will be 
excluded by the choice. 

• Finally, there is nothing special about the "tophat" binning model used 
in the above examples. Everything can be generalized to smoothly 
overlapping bins, in which points are assigned fractionally to multiple 
bins. In general, smoother binning models make for more well-behaved 
derivatives of the jackknife likelihood and therefore more straightfor- 
ward optimization. This can also all be generalized to kernel-smoothing 
techniques for density estimation, which ought to be made the subject 
of a separate note. 
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Figure 1: Constant- width binning of a set of measured galaxy colors. The 
top-left panel shows grid searches in binsize for the eight possible combi- 
nations of smoothing a = (10,1,0.1,0.01) and binning phase S = (0,0.5) 
(see text for definitions). The top-right panel shows the data binned with 
the maximum-likelihood binning parameters. The bottom panels show the 
same, but for a randomly chosen subsample. 



83487 points 




183487 points 



0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 
log 10 number of bins in °' 1 [g , -'r] 



000 points 




0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 
log 10 number of bins in °' 1 [g— t] 



S 3 r 




0.2 0.4 0.6 0. 

0A [g-r] 

1000 points 

-n | i i i | i i i | i i i | 

best binning 



S 3: 




0.2 0.4 0.6 0.8 
° J [9-r] 



1.0 



Figure 2: Two-dimensional constant- width binning of a set of measured 
galaxy colors and radial profile shapes (as parameterized by the Sersic index 
n). The top-left panel shows a grid search in the two binsizes, with smoothing 
fixed at a = 1.0 and both phases fixed at 5 = 0. The top-right panel shows 
the data binned with the maximum-likelihood binning parameters, plus con- 
tours at 2, 10, 25, 50, and 75 percent of the maximum value. The bottom 
panels show the same, but for a randomly chosen subsample. 



